-
Notifications
You must be signed in to change notification settings - Fork 390
trace: add barebones ptrace setup #4401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partial review, lunchtime, will review more later
src/bin/miri.rs
Outdated
#[cfg(target_os = "linux")] | ||
if !miri_config.native_lib.is_empty() && !miri_config.force_old_native_lib { | ||
// FIXME: This should display a diagnostic / warning on error | ||
// SAFETY: No other threads have spawned yet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a footgun. If we register the ctlrc handler before this then suddenly it's UB? Not sure how to improve this, maybe there's a way to ask the process how many threads it has or sth?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, to be fair we can play a little fast and loose here. The docs for fork()
claim that after forking in a multithreaded context the child process must only do async-signal-safe things, but that's in a "it might be safe to do other things but it depends on what the other threads are doing" way, not in an "it's always UB to call async-signal-unsafe code" way. As long as we know what the ctrlc handler / other threads are doing, this is still sound to do in a multithreaded context. I'll check the code but given we're talking about an interrupt handler, I find it extremely unlikely that it's an issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, leaving some more info on this safety comment about what you just told me works for me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK fork
suspends/deletes all threads except for the one doing the fork
. So why does it matter what those other threads do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with fork
in a concurrent program is that those other threads may hold locks, which will now never be released. That's why async-signal-safety matters -- it's a lot like a signal handler, where the corresponding main thread may hold locks that will never get released while the signal handler runs.
src/shims/trace/mod.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RalfJung should we move this module and IsolatedAlloc into a separate crate within this repo that we depend on via a path dependency? Should speed up edit cycles by a small amount and generally improve separation (and thus allow unit testing nicely and potentially moving out of tree at some point). It worked for ui_test 😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @RalfJung
thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea this lgtm. Really cool that it's possible
Think this all addresses the review, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please rebase and squash
I don't think it's worth a separate crate tbh. But I have also not looked into this PR at all yet.^^
|
Apply suggestions from code review Co-authored-by: Oli Scherer <github35764891676564198441@oli-obk.de> review comments fix possible hang
In transit currently so I can't build Miri locally to test this but if CI passes this should be fine, ty! Out of sheer curiosity regarding the code which caused the merge conflict, is there any semantic difference between |
Oh that was fallout from the metasized work. This may just be stuff from early drafts we forgot to fix later or it's something about inference. Unsure. But it should almost always be equivalent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still have a bunch of questions and comments. Some of them are resolved in #4418, but a bunch are clarification questions, so I'd appreciate a follow-up PR extending the comments here and possibly doing some refactoring.
/// Returns a vector of page addresses managed by the allocator. | ||
pub fn pages(&self) -> Vec<usize> { | ||
let mut pages: Vec<_> = | ||
self.page_ptrs.clone().into_iter().map(|p| p.expose_provenance()).collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allocates the buffer for these pages twice... that seems rather unnecessary.
@@ -227,10 +227,11 @@ impl rustc_driver::Callbacks for MiriCompilerCalls { | |||
} else { | |||
let return_code = miri::eval_entry(tcx, entry_def_id, entry_type, &config, None) | |||
.unwrap_or_else(|| { | |||
#[cfg(target_os = "linux")] | |||
miri::register_retcode_sv(rustc_driver::EXIT_FAILURE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this done here but not all the other places where we exit
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to have something to do with the way abort_if_errors()
exits not properly getting us a return code. I'm unsure of what the root cause is - I spent a lot of time trying to debug that but it seems to be the only thing that causes this behaviour, so I special-cased it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
abort_if_errors
doesn't exit, it unwinds. catch_with_exit_code
then turns that into an exit
.
I think you should be able to remove the random register_retcode
here, and then once #4406 lands add this inside the new fn exit
there.
Or does ptrace halt execution on an unwind? It should just resume execution then to let the program run its normal course.
// thread in an async-signal-unsafe way such as by accessing shared | ||
// semaphores, etc.; the handler only calls `sleep()` and `exit()`, which | ||
// are async-signal-safe, as is accessing atomics | ||
let _ = unsafe { miri::init_sv() }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we just ignore errors here...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it errored and the sv process failed to init, then we catch that when calling poll()
on it. I would like this to emit a diagnostic on error, though; it just didn't seem necessary to include that as part of this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying we dynamically fall back to calling native code without fine-grained tracing if the setup fails?
That makes sense, but having a let _
then does not. Also, there should be a comment.
emit a diagnostic on error, though; it just didn't seem necessary to include that as part of this PR
If you have confusing things like let _
instead you need to explain those, or at least add FIXME(ptrace): emit a warning here
or so.
EDIT: There actually is a FIXME, fair. I think I'd have rather seen this staged a bit differently but that's getting into very subjective territory. ;)
/// receiving back events through `get_events`. | ||
/// | ||
/// # Safety | ||
/// The invariants for `fork()` must be upheld by the caller. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this safety comment. It should at least provide an easily clickable references to those fork
invariants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add that, sure thing.
// handler), they will not interact with anything on the main rustc/Miri | ||
// thread in an async-signal-unsafe way such as by accessing shared | ||
// semaphores, etc.; the handler only calls `sleep()` and `exit()`, which | ||
// are async-signal-safe, as is accessing atomics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is sounds like this relying on the ctrlc internal implementation details that could change any time...?
I also have no clue why we are talking about threads here, but that may be because init_sv
doesn't have a self-contained safety comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it somewhat hard to imagine how the ctrlc handler could realistically be changed to act in ways that may break the safety invariants for signal safety, but for clarity I'll add a better safety comment for init_sv
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main point is that I don't think it even matters what the ctrlc handler does. I think you misunderstood the safety requirements of fork
.
// reason to use a different one | ||
let mut curr_pid = init_pid; | ||
|
||
// There's an initial sigstop we need to deal with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does that come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
init_sv()
raises a sigstop on the child arm, to wait in place and make sure that the parent process can ptrace
it properly. If it can't, then the child is killed instead of being resumed from sigstop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we need documentation somewhere for the overarching protocol of who raises which signal when and what happens then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put that in trace/messages.rs
but I can expand on it if it's insufficient as-is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code here runs before the main event loop, the comment in messages.rs
explains the body of the loop IIUC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll put it in the docs for init_sv
then since that's the most obvious place I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's part of the protocol, I think it makes sense to have the entire protocol in one place somewhere. So I'd add it to messages.rs
. The text there could also make it more clear that those steps there are being repeated arbitrarily often, and which message types are used for which of these messages.
confirm_tx.send(Confirmation).unwrap(); | ||
// We can't trust simply calling `Pid::this()` in the child process to give the right | ||
// PID for us, so we get it this way | ||
curr_pid = wait_for_signal(None, signal::SIGSTOP, false).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who's ending this SIGSTOP and where? Seems like there's a protocol here that I haven't seen documented so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the sigstop that the child raises in start_ffi()
. Then calling ptrace::syscall()
is what ends it (it's the same as ptrace::cont()
but also pauses the child when it enters or exits a syscall; right now unused, but will be important for the malloc tracing later)
// Child entered a syscall; we wait for exits inside of this, so it | ||
// should never trigger on return from a syscall we care about |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this comments. Who's waiting for what where and why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bah, this is wrong. I should have edited this when I gutted the malloc tracing bits, apologies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am very surprised that this passed CI since we should be running this on macOS as well. Any idea what is happening here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cfg(linux)
'ed all of the trace-related bits since macOS has a very barebones (bad) ptrace implementation, so when running on it it'll just use the old behaviour
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, so macos doesn't get the fine-grained ptrace. But macos apparently still prints the error that should only be printed when using fine-grained ptrace. So there's clearly a bug here, and your reply doesn't explain it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shims/trace
is a bit ambiguous IMO, let's move this inside a native_lib
module.
Per the review in #4326, this implements the absolute bare minimum ptrace bits; no access is actually intercepted or logged, but when running in native-lib mode Miri will now fork itself and trace the child process unless this is disabled via CLI flag. Currently the docs are left the same as in the original PR i.e. they might reference things that aren't implemented yet, but I can change that.